Enable Cross layers KV cache layout at NIXL Connector#30207
Enable Cross layers KV cache layout at NIXL Connector#30207NickLucche merged 62 commits intovllm-project:mainfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request enables the use of a continuous cross-layer KV cache layout with the NIXL connector, which is a great performance optimization. The changes correctly signal the preference for this layout and add the necessary hooks to handle it.
However, I've found a critical issue in the implementation within NixlConnectorWorker.register_kv_caches. The existing logic does not correctly handle the shape of the cross-layer KV cache tensor, leading to incorrect calculation of num_blocks and num_layers. This will likely cause the KV cache transfer to fail for multi-layer models. I've added a detailed comment on this.
Once this issue is addressed, the PR should be in good shape.
|
Hi @liranschour, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| cross_layer_name = "ALL_LAYERS" | ||
|
|
||
| kv_caches = {cross_layer_name: kv_cache} | ||
| self.connector_worker.cross_layers = True | ||
| self.connector_worker.register_kv_caches(kv_caches) |
There was a problem hiding this comment.
Cross-layer caches break post-processing for heterogeneous blocks
The new cross-layer registration stores a single combined KV tensor under one key and sets cross_layers = True, but downstream post-processing still assumes split K/V caches whenever kv_topo.split_k_and_v is true. In the heterogeneous block-size path (block_size_ratio > 1) the code in blocksize_post_process/permute_device_kv indexes device_kv_caches as if each value were a (k,v) tuple, so with the cross-layer tensor it iterates over its first dimension and sample_cache = …[0] becomes a single block rather than a cache tensor. That path will raise shape errors or silently corrupt transfers as soon as heterogeneous block sizes are encountered, which are explicitly supported by this connector. The cross-layer registration should either propagate a non-split layout or adjust the post-processing to handle the combined tensor.
Useful? React with 👍 / 👎.
Signed-off-by: Liran Schour <lirans@il.ibm.com>
Signed-off-by: Liran Schour <lirans@il.ibm.com>
d75ca47 to
0f36888
Compare
|
Hi @liranschour, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, |
b3b5d45 to
0f36888
Compare
Signed-off-by: Liran Schour <lirans@il.ibm.com>
orozery
left a comment
There was a problem hiding this comment.
Looks good overall. Thanks!
Also wanted to share a summary I prepared on how this change effects the I/O size per different models:
| Model | Old I/O size | New I/O size |
|---|---|---|
| deepseek-ai/DeepSeek-R1-Distill-Qwen-32B (tensor_parallel_size=2) | 16 KB | 2 MB |
| deepseek-ai/DeepSeek-V2-Lite-Chat (GPU block size=64) | 72 KB | 1.9 MB |
| meta-llama/Llama-3.1-8B-Instruct | 32 KB | 2 MB |
| meta-llama/Llama-3.2-1B-Instruct | 16 KB | 0.5 MB |
| meta-llama/Llama-3.1-70B-Instruct | 8 KB | 1.25 MB |
| mistralai/Mistral-7B-Instruct-v0.2 | 32 KB | 2 MB |
| mistralai/Mistral-Small-24B-Instruct-2501 | 32 KB | 2.5 MB |
| Qwen/Qwen2.5-3B-Instruct | 8 KB | 0.56 MB |
| Qwen/Qwen3-0.6B | 32 KB | 1.75 MB |
| Qwen/Qwen2.5-7B-Instruct | 16 KB | 0.87 MB |
| Qwen/Qwen3-4B-Instruct-2507 | 32 KB | 2.25 MB |
| Qwen/Qwen2.5-1.5B-Instruct | 8 KB | 0.44 MB |
| Qwen/Qwen3-8B | 28 KB | 1.97 MB |
| Qwen/Qwen3-1.7B | 32 KB | 1.75 MB |
| Qwen/Qwen3-32B (tensor_parallel_size=2) | 16 KB | 2 MB |
| if vllm_config.kv_transfer_config is None: | ||
| raise ValueError("kv_transfer_config must be set for NixlConnector") | ||
| self.kv_transfer_config = vllm_config.kv_transfer_config | ||
| self.cross_layers = False |
There was a problem hiding this comment.
I think that we should move this variable into TpKVTopology.
Then TpKVTopology.split_k_and_v can return False if we are in a cross layers case.
This should also address the bug that gemini pointed out in the current revision of this PR.
This change means we should defer the initialization of TpKVTopology from NixlConnectorWorker.__init__ to NixlConnectorWorker.register_kv_caches.
There was a problem hiding this comment.
Done. Moved this logic under TpKVToplogy
| # TODO (NickLucche): Get kernel_block_size in a cleaner way | ||
| # NHD default "view" for non-MLA cache | ||
| if self.device_type == "cpu": | ||
| if self.device_type == "cpu" or self.cross_layers: | ||
| block_size_position = -2 | ||
| else: | ||
| block_size_position = -2 if self.use_mla else -3 |
There was a problem hiding this comment.
As I suggested to eliminate the self.cross_layers variable,
I suggest that we move this entire logic behind a new property TpKVTopology.block_size_position.
cc @NickLucche
There was a problem hiding this comment.
Done. Moved this logic under TpKVToplogy
| # Attention backend and KV cache dtype affect memory layout | ||
| "attn_backend_name": attn_backend_name, | ||
| "cache_dtype": str(cache_config.cache_dtype), | ||
| "cross_layers": CROSS_LAYERS, |
There was a problem hiding this comment.
An option to consider:
Defer the computation of self.compat_hash to NixlConnectorWorker.register_kv_caches.
This will allow you to conditionally set this new cross_layers compatibility constraint only if cross layers is actually being used.
There was a problem hiding this comment.
I deferred the compatibility computation to NixlConnectorWorker.register_kv_caches
Signed-off-by: Liran Schour <lirans@il.ibm.com>
4bf547e to
2a20197
Compare
|
Hi @liranschour, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
NickLucche
left a comment
There was a problem hiding this comment.
Thanks for the work @liranschour and @orozery !
Planning to review/run some benchmarks asap, in the meantime can we run tp_config_sweep.ch to verify heteroTP correctness with this layout? Thanks!
Also cc @xuechendi
| @@ -74,6 +74,7 @@ | |||
| # 2: Add remote_request_id to kv_transfer_params | |||
| # | |||
| NIXL_CONNECTOR_VERSION: int = 2 | |||
| CROSS_LAYERS: bool = True | |||
There was a problem hiding this comment.
I will remove this. It is a leftover from previous version of the code.
There was a problem hiding this comment.
@NickLucche I am able to run successfully tp_config_sweep_accuracy_test.sh only on Qwen/Qwen3-0.6B:
- "GPU_MEMORY_UTILIZATION=0.6 PREFILLER_TP_SIZE=2 DECODER_TP_SIZE=2"
- "GPU_MEMORY_UTILIZATION=0.6 PREFILLER_TP_SIZE=1 DECODER_TP_SIZE=2"
DeppSeek models are failing on my environment also on the main branch. So probably I have some error in the environment.
Signed-off-by: Liran Schour <lirans@il.ibm.com>
85d5cf0 to
073b30e
Compare
|
Hi @liranschour, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
Signed-off-by: Liran Schour <lirans@il.ibm.com>
67d23fe to
b403a9e
Compare
|
Hi @liranschour, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
| attn_backend: type[AttentionBackend] | ||
| engine_id: str | ||
| remote_block_size: dict[str, int] | ||
| cross_layers: bool |
There was a problem hiding this comment.
nit: maybe change to cross_layers_blocks / cross_layers_kv_cache or something else that is a bit more self-explanatory?
Also applies for the other cross_layers introduced in this PR.
| return self.remote_block_size[self.engine_id] | ||
|
|
||
| def block_size_position(self, device_type: str) -> int: | ||
| if device_type == "cpu" or self.cross_layers: |
There was a problem hiding this comment.
I think that device_type is nixl-connector specific. Can you move it out from here to the nixl-connector?
| logger.debug("Detected attention backend %s", self.backend_name) | ||
| logger.debug("Detected kv cache layout %s", self.kv_cache_layout) | ||
|
|
||
| self.compat_hash = compute_nixl_compatibility_hash( |
There was a problem hiding this comment.
I think it's better to leave initialization to None here, so that all fields are declared in __init__.
See for example how it's done in gpu_model_runner.py:
# Lazy initializations
# self.model: nn.Module # Set after load_model
# Initialize in initialize_kv_cache
self.kv_caches: list[torch.Tensor] = []
# Initialize in initialize_kv_cache_tensors
self.cross_layers_kv_cache: torch.Tensor | None = None
self.cross_layers_attn_backend: type[AttentionBackend] | None = None
| def register_kv_caches(self, kv_caches: dict[str, torch.Tensor]): | ||
| """Register the KV Cache data in nixl.""" | ||
|
|
||
| backend = get_attn_backend( |
There was a problem hiding this comment.
nit: introduce self.attn_backend in __init__ and re-use it here?
| is_mla=self.use_mla, | ||
| total_num_kv_heads=self.model_config.get_total_num_kv_heads(), | ||
| attn_backend=backend, | ||
| cross_layers=next(iter(kv_caches)) == "ALL_LAYERS", |
There was a problem hiding this comment.
There is a small edge-case if a model has a layer named ALL_LAYERS :)
In the offloading connector we detect a cross layers kv cache differently:
vllm/vllm/v1/kv_offload/worker/cpu_gpu.py
Lines 197 to 205 in 3327807
Actually, maybe you can introduce TpKVTopology.tensor_shape initialized from the kv_caches tensor shape, and then set _cross_layers in TpKVTopology.__post_init__ ?
| self.enforce_compat_hash = self.kv_transfer_config.get_from_extra_config( | ||
| "enforce_handshake_compat", True | ||
| ) |
There was a problem hiding this comment.
This can be moved back to __init__.
| attn_backend=backend, | ||
| cross_layers=next(iter(kv_caches)) == "ALL_LAYERS", | ||
| ) | ||
| self._use_pallas = self.kv_topo._use_pallas |
There was a problem hiding this comment.
We can either move this back to __init__ (setting it using attn_backend == AttentionBackendEnum.PALLAS),
or remove it all together and switch self._use_pallas to self.kv_topo.use_pallas (new property).
ad45e7c to
02718ad
Compare
…0207) Signed-off-by: Liran Schour <lirans@il.ibm.com> Signed-off-by: liranschour <liranschour@users.noreply.github.com> Co-authored-by: Or Ozeri <or@ozery.com> Signed-off-by: mohammad najafi <mohammad.najafi@amd.com>
…0207) Signed-off-by: Liran Schour <lirans@il.ibm.com> Signed-off-by: liranschour <liranschour@users.noreply.github.com> Co-authored-by: Or Ozeri <or@ozery.com> Signed-off-by: 陈建华 <1647430658@qq.com>
…0207) Signed-off-by: Liran Schour <lirans@il.ibm.com> Signed-off-by: liranschour <liranschour@users.noreply.github.com> Co-authored-by: Or Ozeri <or@ozery.com>
…roject#30207)" This reverts commit 64e3d67.
…roject#30207)" This reverts commit 64e3d67. Signed-off-by: Or Ozeri <oro@il.ibm.com>
…" (#33241) Signed-off-by: Or Ozeri <oro@il.ibm.com> Co-authored-by: Kevin H. Luu <khluu000@gmail.com>
…roject#30207)" (vllm-project#33241) Signed-off-by: Or Ozeri <oro@il.ibm.com> Co-authored-by: Kevin H. Luu <khluu000@gmail.com>
…roject#30207)" (vllm-project#33241) Signed-off-by: Or Ozeri <oro@il.ibm.com> Co-authored-by: Kevin H. Luu <khluu000@gmail.com> Signed-off-by: PiratePai <416932041@qq.com> Signed-off-by: Pai <416932041@qq.com>
…roject#30207)" (vllm-project#33241) Signed-off-by: Or Ozeri <oro@il.ibm.com> Co-authored-by: Kevin H. Luu <khluu000@gmail.com> (cherry picked from commit 2e8de86)
…0207) Signed-off-by: Liran Schour <lirans@il.ibm.com> Signed-off-by: liranschour <liranschour@users.noreply.github.com> Co-authored-by: Or Ozeri <or@ozery.com>
…roject#30207)" (vllm-project#33241) Signed-off-by: Or Ozeri <oro@il.ibm.com> Co-authored-by: Kevin H. Luu <khluu000@gmail.com>
…roject#30207)" (vllm-project#33241) Signed-off-by: Or Ozeri <oro@il.ibm.com> Co-authored-by: Kevin H. Luu <khluu000@gmail.com> (cherry picked from commit 2e8de86)
…roject#30207)" (vllm-project#33241) Signed-off-by: Or Ozeri <oro@il.ibm.com> Co-authored-by: Kevin H. Luu <khluu000@gmail.com> (cherry picked from commit 2e8de86)
…roject#30207)" (vllm-project#33241) Signed-off-by: Or Ozeri <oro@il.ibm.com> Co-authored-by: Kevin H. Luu <khluu000@gmail.com> (cherry picked from commit 2e8de86) feat(otel): production-ready OpenTelemetry logging (streaming, SGLang-compatible format) \nSquashed commits:\n- feat: Production-ready OpenTelemetry logging with streaming support\n- fix(otel): match SGLang logging format with proper kvlistValue structure\n feat(openai): streaming improvements for completions API - max_tokens: null support - remove asserts in serving_completion.py so null max_tokens uses computed max_model_len - prompt_length value - UTF-8 streaming fix - hold back any delta containing U+FFFD replacement char in incremental detokenizer to prevent streaming corrupted multi-byte chars like "�️�️" for emojis - gpt-oss special tokens - default skip_special_tokens=False for harmony models in chat completions when caller doesn't explicitly set it, so protocol framing tokens are preserved in output Tested on /v1/completions stream=true endpoint: - max_tokens:null streams successfully - stream_options.include_usage returns usage chunks - emoji/UTF-8 streaming produces clean output (no U+FFFD) - skip_special_tokens:false accepted without error feat(kimi): comprehensive Kimi K2 tool call streaming fixes - Fix same-tool-multiple-times streaming using re.finditer for correct match indexing - Fix previous_texts[i] accumulation for named tool_choice streaming - Add Kimi K2 marker detection and extraction helpers - Handle raw JSON output when reasoning parser returns it as reasoning content - Add per-tool name tracking (tool_name_sent_arr) for parallel tool calls - Strip all tool markers from leaked content after section ends - Add finish-time handling for tools whose names weren't sent during streaming - Handle string vs dict arguments in remaining args logic - Update KIMI_K2_VLLM_CHANGES.md with comprehensive porting documentation Test results: - 15/15 parallel tool calls - 10/10 concurrent streaming requests - 121/121 full tool calling suite checks - ~95% edge case tests (failures are model behavior, not bugs) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
…roject#30207)" (vllm-project#33241) Signed-off-by: Or Ozeri <oro@il.ibm.com> Co-authored-by: Kevin H. Luu <khluu000@gmail.com> (cherry picked from commit 2e8de86) feat(otel): production-ready OpenTelemetry logging (streaming, SGLang-compatible format) \nSquashed commits:\n- feat: Production-ready OpenTelemetry logging with streaming support\n- fix(otel): match SGLang logging format with proper kvlistValue structure\n feat(openai): streaming improvements for completions API - max_tokens: null support - remove asserts in serving_completion.py so null max_tokens uses computed max_model_len - prompt_length value - UTF-8 streaming fix - hold back any delta containing U+FFFD replacement char in incremental detokenizer to prevent streaming corrupted multi-byte chars like "�️�️" for emojis - gpt-oss special tokens - default skip_special_tokens=False for harmony models in chat completions when caller doesn't explicitly set it, so protocol framing tokens are preserved in output Tested on /v1/completions stream=true endpoint: - max_tokens:null streams successfully - stream_options.include_usage returns usage chunks - emoji/UTF-8 streaming produces clean output (no U+FFFD) - skip_special_tokens:false accepted without error feat(kimi): comprehensive Kimi K2 tool call streaming fixes - Fix same-tool-multiple-times streaming using re.finditer for correct match indexing - Fix previous_texts[i] accumulation for named tool_choice streaming - Add Kimi K2 marker detection and extraction helpers - Handle raw JSON output when reasoning parser returns it as reasoning content - Add per-tool name tracking (tool_name_sent_arr) for parallel tool calls - Strip all tool markers from leaked content after section ends - Add finish-time handling for tools whose names weren't sent during streaming - Handle string vs dict arguments in remaining args logic - Update KIMI_K2_VLLM_CHANGES.md with comprehensive porting documentation Test results: - 15/15 parallel tool calls - 10/10 concurrent streaming requests - 121/121 full tool calling suite checks - ~95% edge case tests (failures are model behavior, not bugs) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
…roject#30207)" (vllm-project#33241) Signed-off-by: Or Ozeri <oro@il.ibm.com> Co-authored-by: Kevin H. Luu <khluu000@gmail.com> (cherry picked from commit 2e8de86) feat(otel): production-ready OpenTelemetry logging (streaming, SGLang-compatible format) \nSquashed commits:\n- feat: Production-ready OpenTelemetry logging with streaming support\n- fix(otel): match SGLang logging format with proper kvlistValue structure\n feat(openai): streaming improvements for completions API - max_tokens: null support - remove asserts in serving_completion.py so null max_tokens uses computed max_model_len - prompt_length value - UTF-8 streaming fix - hold back any delta containing U+FFFD replacement char in incremental detokenizer to prevent streaming corrupted multi-byte chars like "�️�️" for emojis - gpt-oss special tokens - default skip_special_tokens=False for harmony models in chat completions when caller doesn't explicitly set it, so protocol framing tokens are preserved in output Tested on /v1/completions stream=true endpoint: - max_tokens:null streams successfully - stream_options.include_usage returns usage chunks - emoji/UTF-8 streaming produces clean output (no U+FFFD) - skip_special_tokens:false accepted without error feat(kimi): comprehensive Kimi K2 tool call streaming fixes - Fix same-tool-multiple-times streaming using re.finditer for correct match indexing - Fix previous_texts[i] accumulation for named tool_choice streaming - Add Kimi K2 marker detection and extraction helpers - Handle raw JSON output when reasoning parser returns it as reasoning content - Add per-tool name tracking (tool_name_sent_arr) for parallel tool calls - Strip all tool markers from leaked content after section ends - Add finish-time handling for tools whose names weren't sent during streaming - Handle string vs dict arguments in remaining args logic - Update KIMI_K2_VLLM_CHANGES.md with comprehensive porting documentation Test results: - 15/15 parallel tool calls - 10/10 concurrent streaming requests - 121/121 full tool calling suite checks - ~95% edge case tests (failures are model behavior, not bugs) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
…roject#30207)" (vllm-project#33241) Signed-off-by: Or Ozeri <oro@il.ibm.com> Co-authored-by: Kevin H. Luu <khluu000@gmail.com> (cherry picked from commit 2e8de86) feat(otel): production-ready OpenTelemetry logging (streaming, SGLang-compatible format) \nSquashed commits:\n- feat: Production-ready OpenTelemetry logging with streaming support\n- fix(otel): match SGLang logging format with proper kvlistValue structure\n feat(openai): streaming improvements for completions API - max_tokens: null support - remove asserts in serving_completion.py so null max_tokens uses computed max_model_len - prompt_length value - UTF-8 streaming fix - hold back any delta containing U+FFFD replacement char in incremental detokenizer to prevent streaming corrupted multi-byte chars like "�️�️" for emojis - gpt-oss special tokens - default skip_special_tokens=False for harmony models in chat completions when caller doesn't explicitly set it, so protocol framing tokens are preserved in output Tested on /v1/completions stream=true endpoint: - max_tokens:null streams successfully - stream_options.include_usage returns usage chunks - emoji/UTF-8 streaming produces clean output (no U+FFFD) - skip_special_tokens:false accepted without error feat(kimi): comprehensive Kimi K2 tool call streaming fixes - Fix same-tool-multiple-times streaming using re.finditer for correct match indexing - Fix previous_texts[i] accumulation for named tool_choice streaming - Add Kimi K2 marker detection and extraction helpers - Handle raw JSON output when reasoning parser returns it as reasoning content - Add per-tool name tracking (tool_name_sent_arr) for parallel tool calls - Strip all tool markers from leaked content after section ends - Add finish-time handling for tools whose names weren't sent during streaming - Handle string vs dict arguments in remaining args logic - Update KIMI_K2_VLLM_CHANGES.md with comprehensive porting documentation Test results: - 15/15 parallel tool calls - 10/10 concurrent streaming requests - 121/121 full tool calling suite checks - ~95% edge case tests (failures are model behavior, not bugs) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
…roject#30207)" (vllm-project#33241) Signed-off-by: Or Ozeri <oro@il.ibm.com> Co-authored-by: Kevin H. Luu <khluu000@gmail.com> (cherry picked from commit 2e8de86)
Purpose
Enable NIXL Connector to us the new continuous cross layer KV cache layout described in RFC #27742 and implemented in #27743.
Demonstrate performance improvement of more the 2x in Tok/sec and TTFT due to dramatic reduction of fragmentation of transfer buffers.
Tested with P!=D with run_accuracy_test.sh P=1 D=2
Tested on H100 1P1D Qwen/Qwen3-0.6B: